18.1 The log-likehood GradientΒΆ

Target probability distribution:

\[p(x; \theta) = \frac{1}{Z(\theta)}\hat{p}(x; \theta)\]

The gradient

\[\nabla_{\theta} = \nabla_{\theta} \log\hat{p}(x;\theta) - \nabla_{\theta} \log Z(\theta)\]
  • \(\nabla_{\theta} \log\hat{p}(x;\theta)\) : Positive phase. Models with no latent variables or with few interaction between the latent variables typically have a tractable positive phase.
  • \(\nabla_{\theta} \log Z(\theta)\): Negative phase. For undirected model, the negative phase is difficult.

Closer look at gradient of log Z, assuming p(x) > 0 for all x:

\[\begin{split}\begin {equation} \begin{split} \nabla_{\theta} \log Z &= \frac{\nabla_{\theta} Z}{Z} \\ &= \frac{\nabla_{\theta}\sum_x \hat{p}(x)}{Z} \\ &= \frac{\sum_x \nabla_{\theta} \hat{p}(x)}{Z} \\ &= \frac{\sum_x \hat{p}(x) \frac{\nabla_{\theta} \hat{p}(x)}{\hat{p}(x)} }{Z} \\ &= \frac{\sum_x \hat{p}(x) \nabla_{\theta} \log \hat{p}(x) }{Z} \\ &= \sum_x p(x) \nabla_{\theta} \log \hat{p}(x) \\ &= E_{x\sim p(x)} \nabla_{\theta} \log\hat{p}(x) \end{split} \end {equation}\end{split}\]

This is the basis for a variety of Monte Carlo methods for approximately maximizing the likehood of models with intractable partition function.

  • In the positive phase, we increase \(\log \hat{p}(x)\) for drawn from the data.
  • In the negative phase, we decrease the partition function by decreasing \(\log \hat{p}(x)\) drawn from the model distribution.

Review on Monte Carlo Methods:

The idea: view the sum or integral as if it were an expectation under some distribution and to approximate the expectation by a corresponding average. The sum or integral to estimate:

\[\begin{split}s = \sum_x p(x)f(x) = E_p[f(x)] \\ or \\ s = \int p(x)f(x) = E_p[f(x)]\end{split}\]

We can approximate s by drawing n samples \(x^{(1)}, x^{(2)} .... x^{(n)}\) from p and then forming the empirical average

\[\hat{s}_n = \frac{1}{n} \sum_{i=1}^{n}f(x^{(i)})\]